17 research outputs found
Polynomial Time Approximation Schemes for Clustering in Low Highway Dimension Graphs
We study clustering problems such as k-Median, k-Means, and Facility Location in graphs of low highway dimension, which is a graph parameter modeling transportation networks. It was previously shown that approximation schemes for these problems exist, which either run in quasi-polynomial time (assuming constant highway dimension) [Feldmann et al. SICOMP 2018] or run in FPT time (parameterized by the number of clusters k, the highway dimension, and the approximation factor) [Becker et al. ESA 2018, Braverman et al. 2020]. In this paper we show that a polynomial-time approximation scheme (PTAS) exists (assuming constant highway dimension). We also show that the considered problems are NP-hard on graphs of highway dimension 1
Experimental Evaluation of Fully Dynamic k-Means via Coresets
For a set of points in , the Euclidean -means problems
consists of finding centers such that the sum of distances squared from
each data point to its closest center is minimized. Coresets are one the main
tools developed recently to solve this problem in a big data context. They
allow to compress the initial dataset while preserving its structure: running
any algorithm on the coreset provides a guarantee almost equivalent to running
it on the full data.
In this work, we study coresets in a fully-dynamic setting: points are added
and deleted with the goal to efficiently maintain a coreset with which a
k-means solution can be computed. Based on an algorithm from Henzinger and Kale
[ESA'20], we present an efficient and practical implementation of a fully
dynamic coreset algorithm, that improves the running time by up to a factor of
20 compared to our non-optimized implementation of the algorithm by Henzinger
and Kale, without sacrificing more than 7% on the quality of the k-means
solution.Comment: Accepted at ALENEX 2
A Quasi-Polynomial-Time Approximation Scheme for Vehicle Routing on Planar and Bounded-Genus Graphs
The Capacitated Vehicle Routing problem is a generalization of the Traveling Salesman problem in which a set of clients must be visited by a collection of capacitated tours. Each tour can visit at most Q clients and must start and end at a specified depot. We present the first approximation scheme for Capacitated Vehicle Routing for non-Euclidean metrics. Specifically we give a quasi-polynomial-time approximation scheme for Capacitated Vehicle Routing with fixed capacities on planar graphs. We also show how this result can be extended to bounded-genus graphs and polylogarithmic capacities, as well as to variations of the problem that include multiple depots and charging penalties for unvisited clients
Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation
In all state-of-the-art sketching and coreset techniques for clustering, as
well as in the best known fixed-parameter tractable approximation algorithms,
randomness plays a key role. For the classic -median and -means problems,
there are no known deterministic dimensionality reduction procedure or coreset
construction that avoid an exponential dependency on the input dimension ,
the precision parameter or . Furthermore, there is no
coreset construction that succeeds with probability and whose size does
not depend on the number of input points, . This has led researchers in the
area to ask what is the power of randomness for clustering sketches [Feldman,
WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio
achievable deterministically without a complexity exponential in the dimension
are for both -median and -means, even when allowing a
complexity FPT in the number of clusters . This stands in sharp contrast
with the -approximation achievable in that case, when allowing
randomization.
In this paper, we provide deterministic sketches constructions for
clustering, whose size bounds are close to the best-known randomized ones. We
also construct a deterministic algorithm for computing
-approximation to -median and -means in high dimensional
Euclidean spaces in time , close to the
best randomized complexity.
Furthermore, our new insights on sketches also yield a randomized coreset
construction that uses uniform sampling, that immediately improves over the
recent results of [Braverman et al. FOCS '22] by a factor .Comment: FOCS 2023. Abstract reduced for arxiv requirement
Near-linear time approximations schemes for clustering in doubling metrics
International audienceWe consider the classic Facility Location, k-Median, and k-Means problems in metric spaces of constant doubling dimension. We give the first nearly linear-time approximation schemes for each problem, making a significant improvement over the state-of-the-art algorithms. Moreover, we show how to extend the techniques used to get the first efficient approximation schemes for the problems of prize-collecting k-Medians and k-Means, and efficient bicriteria approximation schemes for k-Medians with outliers, k-Means with outliers and k-Center
A New Coreset Framework for Clustering
Given a metric space, the -clustering problem consists of finding
centers such that the sum of the of distances raised to the power of every
point to its closest center is minimized. This encapsulates the famous
-median () and -means () clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the
solutions, also known as \emph{coresets}, has been an important research
direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously
improves upon the best known bounds for a large variety of settings, ranging
from Euclidean space, doubling metric, minor-free metric, and the general
metric cases
Differential Privacy for Clustering Under Continual Observation
We consider the problem of clustering privately a dataset in
that undergoes both insertion and deletion of points. Specifically, we give an
-differentially private clustering mechanism for the -means
objective under continual observation. This is the first approximation
algorithm for that problem with an additive error that depends only
logarithmically in the number of updates. The multiplicative error is
almost the same as non privately. To do so we show how to perform dimension
reduction under continual observation and combine it with a differentially
private greedy approximation algorithm for -means. We also partially extend
our results to the -median problem
Fully Dynamic Consistent Facility Location
We consider classic clustering problems in fully dynamic data streams, where data elements can be both inserted and deleted. In this context, several parameters are of importance: (1) the quality of the solution after each insertion or deletion, (2) the time it takes to update the solution, and (3) how different consecutive solutions are. The question of obtaining efficient algorithms in this context for facility location, k-median and k-means has been raised in a recent paper by Hubert-Chan et al. [WWW'18] and also appears as a natural follow-up on the online model with recourse studied by Lattanzi and Vassilvitskii [ICML'17] (i.e.: in insertion-only streams). In this paper, we focus on general metric spaces and mainly on the facility location problem. We give an arguably simple algorithm that maintains a constant factor approximation, with O(n log n) update time, and total recourse O(n). This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to O(n^2) update time, and O(n^2) total recourse. These bounds are nearly optimal: in general metric space, inserting a point take O(n) times to describe the distances to other points, and we give a simple lower bound of O(n) for the recourse. Moreover, we generalize this result for the k-medians and k-means problems: our algorithm maintains a constant factor approximation in time O˜(n+k^2). We complement our analysis with experiments showing that the cost of the solution maintained by our algorithm at any time t is very close to the cost of a solution obtained by quickly recomputing a solution from scratch at time t while having a much better running time